Clustering Analysis and Ideal Point Estimation for Ranking Data, a Test of Spatial Models of Voting and Application to Democratic Party Ballot Data

Introduction

What is the best way to reduce the dimensionality of ranking data? Is clustering sufficient or can we do better and recreate the latent ideology space that political rankings come from?

These questions first cropped up in my work on racial polarization in ranked choice voting elections. Racial polarization metrics for plurality elections typically compare the race of respondents (or aggregated respondents) to their choice for a political office, but since ranked choice elections produce more than one choice, creating a comparable metric becomes difficult. This is particularly true because theories of ranked choice elections predict that they depolarize elections, but the mechanism by which this occurs is only relevant to choices lower on a voters list. In order to test these theories, we need a metric that condenses ranked choices into 1 variable while preserving what makes the rankings interesting. Clustering is the obvious solution, but given the desirability of a continuous variable and the success of ideal point estimation applications in similar settings, ideal point generating processes such as PCA are also tempting.

This project interogates these questions using simulated preference orderings to measure the accuracy of knn and pca models in R in replicating a set of clusters and the space of these clusters respectively, then applies what I learn to an actual ranked choice election: the 2020 Democratic primary, to see what the best techniques can tell us.

Trivial Case: Clusters and PCA on a One Dimensionsal Issue Space

set.seed(27680)
vote_pref_dims1 <- data.frame()

## generate voters
voters<-rnorm(n=1000,mean=0,sd=.2)
## generate candidates 
candidates <- seq(-1,1, by = .2) + rnorm(n = 11, mean = 0, sd = .4)
## produce rankings
for(val in voters){
vote_pref_dims1<-rbind(vote_pref_dims1,rank(abs(candidates-val))) }
## scale rankings 

sim_comp_dims1<-prcomp(vote_pref_dims1, scale =TRUE)

sim_vs_pr <- data.frame(cbind(voters, sim_comp_dims1$x)) %>%
  dplyr::select(voters, PC1) %>%
  mutate(voters = scales::rescale(voters, to = c(-1, 1))) %>%
  mutate(PC1 = scales::rescale(PC1, to = c(-1, 1)))

## visualize rankings 
ggplot(as.data.frame(sim_vs_pr), aes(x = PC1))+
  geom_density(fill = "lightblue") + labs(x = "preference ideal point") +
  geom_density(aes(x = voters), fill = "#FF6666", alpha = .2) + 
  theme_classic()

## Linear Regression
reg1<- lm(voters ~ PC1, data = sim_vs_pr)
stargazer(reg1, header = FALSE, type = 'html')
Dependent variable:
voters
PC1 0.640***
(0.004)
Constant 0.199***
(0.002)
Observations 1,000
R2 0.962
Adjusted R2 0.962
Residual Std. Error 0.055 (df = 998)
F Statistic 25,420.310*** (df = 1; 998)
Note: p<0.1; p<0.05; p<0.01

As the graph shows, even in the most simple case, PCA is an imperfect measure of the ideological space the rankings are derived from. Still, as the regression table shows, even if PCA is bad at predicting the structure of the ideological space, its an extremely good predictor of the positioning of individual datapoints within the space. While the correlation is far from perfect, it is very strong, and as the \(R^2\) shows, it explains almost all of the variance in the latent ideology variable.

Adding structure (clusters) to the data

## Two clusters 

vote_pref_dims1 <- data.frame()

## generate voters

# assign clusters 
cluster <- rep(1:2, times = 500)
voters <- data.frame(cluster)
voters$x <- NA
# assign position in the space based on clusters 
voters[cluster == 1,]$x<-rnorm(n = 500, mean = -.5, sd = .2)
voters[cluster == 2,]$x<-rnorm(n = 500, mean = .5, sd = .2)
## generate candidates 
candidates <- seq(-1,1, by = .2) + rnorm(n = 11, mean = 0, sd = .4)
## produce rankings
for( i in 1:length(voters$x)){
vote_pref_dims1<-rbind(vote_pref_dims1,rank(abs(candidates-voters$x[i]))) }
## scale rankings 

sim_comp_dims1<-prcomp(vote_pref_dims1, scale =TRUE)

sim_vs_pr <- data.frame(cbind(voters, sim_comp_dims1$x)) %>%
  dplyr::select(cluster, x , PC1) %>%
  mutate(x = scales::rescale(x, to = c(-1, 1))) %>%
  mutate(PC1 = scales::rescale(PC1, to = c(-1, 1)))

## visualize rankings 
ggplot(as.data.frame(sim_vs_pr), aes(x = PC1))+
  geom_density(fill = "lightblue") + labs(x = "preference ideal point") +
  geom_density(aes(x = x), fill = "#FF6666", alpha = .2) + 
  theme_classic()

## Linear Regression
reg1<- lm(x ~ PC1, data = sim_vs_pr)
stargazer(reg1, header = FALSE, type = 'html')
Dependent variable:
x
PC1 0.557***
(0.004)
Constant 0.015***
(0.004)
Observations 1,000
R2 0.941
Adjusted R2 0.941
Residual Std. Error 0.121 (df = 998)
F Statistic 15,885.130*** (df = 1; 998)
Note: p<0.1; p<0.05; p<0.01

Adding clusters causes the PCA algorithm to do a much better job of reconstructing the structure of the latent ideology space. On the other hand, the regression shows that with clusters, PCA does no better and may actually do slightly worse at predicting the positioning of specific data points than without it. What about the estimating the clusters themselves?

## KNN
set.seed(123570)
one_dim_two_clust_kmeans <-kmeans(vote_pref_dims1, centers = 2,# number of clusters
                  nstart = 100)# number of random starts
sim_actual_clust <- cbind(one_dim_two_clust_kmeans$cluster, voters) %>% 
  mutate(simulated_cluster = factor(one_dim_two_clust_kmeans$cluster, labels = c("Cluster 1", "Cluster 2"))) %>% 
  mutate(cluster = factor(cluster, labels = c("Cluster 1", "Cluster 2")))
conf_mat(data = sim_actual_clust, 
         truth = cluster,
         estimate = simulated_cluster )
##            Truth
## Prediction  Cluster 1 Cluster 2
##   Cluster 1       495         6
##   Cluster 2         5       494

As the confidence matrix above shows, the kmeans clustering algorithm predicts the two clusters within the data with around 99% accuracy. This makes sense given what we know so far: predicting the structure of the space is difficult but predicting the attributes of individual points is easier.

With two clusters, the clustering algorithm is much more accurate than the PCA. What about three clusters?

Scaling with three clusters

set.seed(129570)
## Three clusters 

vote_pref_dims1 <- data.frame()

# assign cluster
cluster <- rep(1:3, times = 333)
voters <- data.frame(cluster)
voters$x <- NA
# assign position in space based on cluster 
voters[cluster == 1,]$x<-rnorm(n = 333, mean = -1, sd = .2)
voters[cluster == 2,]$x<-rnorm(n = 333, mean = 1, sd = .2)
voters[cluster == 3,]$x<-rnorm(n = 333, mean = 0, sd = .2)


## generate candidates 
candidates <- seq(-1,1, by = .2) + rnorm(n = 11, mean = 0, sd = .4)
## produce rankings
for( i in 1:length(voters$x)){
vote_pref_dims1<-rbind(vote_pref_dims1,rank(abs(candidates-voters$x[i]))) }
## scale rankings 

sim_comp_dims1<-prcomp(vote_pref_dims1, scale =TRUE)

sim_vs_pr <- data.frame(cbind(voters, sim_comp_dims1$x)) %>%
  dplyr::select(cluster, x , PC1) %>%
  mutate(x = scales::rescale(x, to = c(-1, 1))) %>%
  mutate(PC1 = scales::rescale(PC1, to = c(-1, 1)))


## visualize rankings 
ggplot(as.data.frame(sim_vs_pr), aes(x = PC1))+
  geom_density(fill = "lightblue") + labs(x = "preference ideal point") +
  geom_density(aes(x = x), fill = "#FF6666", alpha = .2) + 
  theme_classic()

## Linear Regression
reg1<- lm(x ~ PC1, data = sim_vs_pr)
stargazer(reg1, header = FALSE, type = 'html')
Dependent variable:
x
PC1 0.615***
(0.005)
Constant 0.004
(0.004)
Observations 999
R2 0.945
Adjusted R2 0.945
Residual Std. Error 0.120 (df = 997)
F Statistic 17,276.580*** (df = 1; 997)
Note: p<0.1; p<0.05; p<0.01

As the density plot shows PCA barely captures the middle cluster, but is able to replicate at least some of the shape of the data. As the regression table shows, PCA is still doing very well at identifying the positioning of individual points.

set.seed(149870)
## KNN

one_dim_three_clust_kmeans <-kmeans(vote_pref_dims1, centers = 3,# number of clusters
                  nstart = 100)# number of random starts
sim_actual_clust <- cbind(one_dim_three_clust_kmeans$cluster, voters) %>% 
  mutate(simulated_cluster = factor(one_dim_three_clust_kmeans$cluster, labels = c("Cluster 1", "Cluster 2", "Cluster 3"))) %>% 
  mutate(cluster = factor(cluster, labels = c("Cluster 1", "Cluster 2", "Cluster 3")))
conf_mat(data = sim_actual_clust, 
         truth = cluster,
         estimate = simulated_cluster )
##            Truth
## Prediction  Cluster 1 Cluster 2 Cluster 3
##   Cluster 1       333         0         7
##   Cluster 2         0       323         1
##   Cluster 3         0        10       325

Clustering with KNN is highly accurate with three clusters as well. There is a little more bleed between clusters which makes sense given that the clusters themselves have greater overlap, but overall KNN continues to outperform PCA.

Clustering and Ideal Point Estimation in two dimensions

set.seed(094334634)

## generate voters

# assign cluster 
cluster <- rep(1:3, times = 333) 
voters <- data.frame(cluster)
voters$x <- NA
voters$y <- NA
# assign positioning in the space based on cluster in two dimensions
voters[cluster == 1,]$x<-rnorm(n = 333, mean = -1, sd = .5)
voters[cluster == 2,]$x<-rnorm(n = 333, mean = 1, sd = .5)
voters[cluster == 3,]$x<-rnorm(n = 333, mean = 0, sd = .5)
voters[cluster == 1,]$y<-rnorm(n = 333, mean = .6, sd = .5)
voters[cluster == 2,]$y<-rnorm(n = 333, mean = .6, sd = .5)
voters[cluster == 3,]$y<-rnorm(n = 333, mean = -.6, sd = .5)
voters <- voters %>% 
  mutate(cluster = factor(cluster, labels = c("Cluster 1", "Cluster 2", "Cluster 3")))
## generate candidates 

a <-seq(-1.5,1.5, by = .4) + rnorm(n = 11, mean = 0, sd = .5)
## Warning in seq(-1.5, 1.5, by = 0.4) + rnorm(n = 11, mean = 0, sd = 0.5): longer
## object length is not a multiple of shorter object length
b <-  rnorm(n = 11, mean = 0, sd = .5)
candidates <- data.frame(a, b)

## Jointly plot voters and candidates
ggplot() +
  geom_point(aes(x = x, y = y, color = cluster), data = voters) + 
  geom_point(aes(x= a, y = b), color = "black", data = candidates)

Simulating data in two dimensions is a little trickier and requires a more sophisticated approach to euclidean distance, but operates on broadly the same principal. For the purposes of two dimensional analysis, I have replaced the density plots with a scatter plot, and plotted the candidate positioning as black dots within the same space.

## produce rankings
vote_pref<-data.frame()
for(i in 1:length(voters$x)){
  vote_pref<-rbind(vote_pref,rank(pointDistance(candidates, voters[i,2:3], lonlat = FALSE))) }
## scale rankings 
sim_comp<-prcomp(vote_pref, scale =TRUE)
voters_sim<- cbind(voters,sim_comp$x)
voters_scaled <- voters_sim %>%
  dplyr::select(x ,y, PC1, PC2) %>%
  mutate(x = scales::rescale(x, to = c(-1, 1))) %>%
  mutate(y = scales::rescale(y, to = c(-1, 1))) %>%
  mutate(PC1 = scales::rescale(PC1, to = c(-1, 1))) %>%
  mutate(PC2 = scales::rescale(PC2, to = c(-1, 1)))

reg2 <- lm(x ~ PC1 + PC2, data = voters_scaled)
reg3 <- lm(y ~ PC1 + PC2, data = voters_scaled)
stargazer(reg2, reg3, header = FALSE, type = 'html')
Dependent variable:
x y
(1) (2)
PC1 0.477*** -0.099***
(0.005) (0.007)
PC2 -0.147*** -0.632***
(0.008) (0.010)
Constant -0.039*** 0.028***
(0.004) (0.005)
Observations 999 999
R2 0.902 0.809
Adjusted R2 0.902 0.809
Residual Std. Error (df = 996) 0.125 0.160
F Statistic (df = 2; 996) 4,584.324*** 2,112.973***
Note: p<0.1; p<0.05; p<0.01
ggplot(as.data.frame(voters_scaled), aes(x = PC1, y = PC2))+
  geom_point(color = "lightblue") + labs(x = "preference ideal point x", y= "preference ideal point y") +
  geom_point(aes(x = x, y = y), data = voters_scaled, color = "red", alpha = .2) + 
  theme_classic()

How does PCA perform when estimating two dimensions? The scatter plot suggests that again, the structure of the data is very imperfectly reconstructed. In particular, the circular nature of the PCA space relative to the triangular shape of the latent space is a common symptom of overfitting with ideal point variables. Distance is easiest to equally distribute around a circle, so ideal points that are imperfectly fitted frequently converge on a circular distribution. Still, while the regression shows some ‘bleed’ between the two dimensions, it demonstrates that the PC1 and PC2 variables still explain the vast majority of the variance in x and y.

## KNN
set.seed(1916934)
two_dim_three_clust_kmeans <-kmeans(vote_pref, centers = 3,# number of clusters
                  nstart = 100)# number of random starts
sim_actual_clust <- cbind(two_dim_three_clust_kmeans$cluster, voters) %>% 
  mutate(simulated_cluster = factor(two_dim_three_clust_kmeans$cluster, labels = c("Cluster 1", "Cluster 2", "Cluster 3")))
  
conf_mat(data = sim_actual_clust, 
         truth = cluster,
         estimate = simulated_cluster )
##            Truth
## Prediction  Cluster 1 Cluster 2 Cluster 3
##   Cluster 1       317         6        33
##   Cluster 2        14         8       212
##   Cluster 3         2       319        88
ggplot(as.data.frame(voters_scaled), aes(x = PC1, y = PC2))+
  geom_point(aes(color = sim_actual_clust$simulated_cluster)) + labs(x = "preference ideal point x", y= "preference ideal point y") +
  geom_point(aes(x = x, y = y, color = sim_actual_clust$cluster), data = voters_scaled, alpha = .2) + 
  theme_classic()

In two dimensions, KNN also has a lot more difficulty, but still does better than PCA. Cluster misidentification in two dimensions also reflects the way I have constructed the two dimensional space, which as you can see in the original cluster scatter plot, features a fair number of dots that are closer to another cluster than the cluster they were generated from.

In two dimensions,

Do more specialized algorithms perform better? A look at prefscale from the smacof package.

scale1<-prefscal(vote_pref,ndim=2,type="ordinal",conditionality="matrix",lambda = .5,omega = .1)
plot(scale1)

unfold <- as.data.frame(scale1$conf.row)
voters_scaled <- voters_scaled %>% 
  mutate(unfold_x = scales::rescale(unfold$D1, to = c(-1, 1))) %>%
  mutate(unfold_y = scales::rescale(unfold$D2, to = c(-1, 1))) 

reg4 <- lm(x ~ unfold_x + unfold_y, data = voters_scaled)
reg5 <- lm(y ~ unfold_x + unfold_y, data = voters_scaled)
stargazer(reg2, reg3, reg4, reg5, header = FALSE, type = 'html')
Dependent variable:
x y x y
(1) (2) (3) (4)
PC1 0.477*** -0.099***
(0.005) (0.007)
PC2 -0.147*** -0.632***
(0.008) (0.010)
unfold_x 0.560*** -0.055***
(0.005) (0.008)
unfold_y -0.023*** 0.590***
(0.006) (0.010)
Constant -0.039*** 0.028*** -0.020*** -0.014**
(0.004) (0.005) (0.003) (0.006)
Observations 999 999 999 999
R2 0.902 0.809 0.933 0.768
Adjusted R2 0.902 0.809 0.933 0.768
Residual Std. Error (df = 996) 0.125 0.160 0.103 0.177
F Statistic (df = 2; 996) 4,584.324*** 2,112.973*** 6,960.213*** 1,651.744***
Note: p<0.1; p<0.05; p<0.01

Are other ideal point estimation algorithms more accurate in recreated the initial space than PCA? To test this I use the function prefscal from the smacof package, this function is specifically designed to recover a two dimensional space from ranking data (sometimes referred to in data science as unfolding). However, as the scatter plot shows, prefscal does not do much visually better than PCA at recreating the latent space and suffers from the same circular overfitting problem. The smacof package features its own scatterplot type show above the gg scater plot which jointly projects candidate positioning and voter positioning, something PCA does not. Here we can see another interesting property of these algorithms, which is that they tend to position candidates outside of the voter space. Turning to the regression, we can see that while prefscal does a little better at predicting the underlying ideal point positions than PCA, the differences are marginal at best.

Hypothesis: Clusters are relatively easy to get right, recovery of actual ideal point space is highly sensitive to number of candidates and whether the positioning of the candidates corresponds to meaningful divisions in the data.

One theory I had running these simulations over and over again and seeing how much variance there was in the outcomes was that the accuracy of the clustering and PCA algorithms might be highly sensitive to the positioning of candidates in the space. There were much fewer candidates in my simulation than voters, but candidates had roughly the same level of variance because they needed to be evenly distributed accross the space. I reasoned that candidates could wind up randomly positioned in ways that skewed voter rankings or made them less predictive. If a candidate was randomly generated in a position where voters would find them universally undesirable, this would rob the simulation of 1/11th of its predictive power. To test whether candidate positioning was the cause of the ‘noisiness’ of the results, I reran the two dimensional simulation with 100 candidates spread widely accross the space.

## Generate a huge and divserse pool of candidates: 
a <- rnorm(n = 100, mean = 0, sd = 1)
b <-  rnorm(n = 100, mean = 0, sd = 1)
candidates <- data.frame(a, b)

ggplot() +
  geom_point(aes(x = x, y = y, color = cluster), data = voters) + 
  geom_point(aes(x= a, y = b), color = "black", data = candidates)

## produce rankings
vote_pref<-data.frame()
for(i in 1:length(voters$x)){
  vote_pref<-rbind(vote_pref,rank(pointDistance(candidates, voters[i,2:3], lonlat = FALSE))) }
## scale rankings 
sim_comp<-prcomp(vote_pref, scale =TRUE)
voters_sim<- cbind(voters,sim_comp$x)
voters_scaled <- voters_sim %>%
  dplyr::select(x ,y, PC1, PC2) %>%
  mutate(x = scales::rescale(x, to = c(-1, 1))) %>%
  mutate(y = scales::rescale(y, to = c(-1, 1))) %>%
  mutate(PC1 = scales::rescale(PC1, to = c(-1, 1))) %>%
  mutate(PC2 = scales::rescale(PC2, to = c(-1, 1)))

reg2 <- lm(x ~ PC1 + PC2, data = voters_scaled)
reg3 <- lm(y ~ PC1 + PC2, data = voters_scaled)
stargazer(reg2, reg3, header = FALSE, type = 'html')
Dependent variable:
x y
(1) (2)
PC1 0.568*** 0.113***
(0.004) (0.004)
PC2 0.179*** -0.648***
(0.005) (0.005)
Constant 0.071*** 0.101***
(0.003) (0.003)
Observations 999 999
R2 0.946 0.951
Adjusted R2 0.946 0.951
Residual Std. Error (df = 996) 0.093 0.081
F Statistic (df = 2; 996) 8,700.073*** 9,752.193***
Note: p<0.1; p<0.05; p<0.01

As the regression table above shows, the gains were far more marginal than I would have hoped had my theory been correct. While the PCA algorithm is somewhat more accurate than it is with fewer candidates the difference should be far greater given that I used 10x as many candidates if candidate positioning was the driver of the noisiness of the data.

set.seed(91395979)
## KNN

two_dim_three_clust_kmeans <-kmeans(vote_pref, centers = 3,# number of clusters
                  nstart = 100)# number of random starts
sim_actual_clust <- cbind(two_dim_three_clust_kmeans$cluster, voters) %>% 
  mutate(simulated_cluster = factor(two_dim_three_clust_kmeans$cluster, labels = c("Cluster 1", "Cluster 2", "Cluster 3")))

conf_mat(data = sim_actual_clust, 
         truth = cluster,
         estimate = simulated_cluster )
##            Truth
## Prediction  Cluster 1 Cluster 2 Cluster 3
##   Cluster 1         3       311        23
##   Cluster 2       313         6        16
##   Cluster 3        17        16       294
ggplot(as.data.frame(voters_scaled), aes(x = PC1, y = PC2))+
  geom_point(aes(color = sim_actual_clust$simulated_cluster)) + labs(x = "preference ideal point x", y= "preference ideal point y") +
  geom_point(aes(x = x, y = y, color = sim_actual_clust$cluster), data = voters_scaled, alpha = .2) + 
  theme_classic()

The gains from increasing the number of candidates for clustering are more significant, the clustering algorithm cuts its error rate roughly in half. I also tested whether prefscal performed better with more candidates.

scale2<-prefscal(vote_pref,ndim=2,type="ordinal",conditionality="matrix",lambda = .5,omega = .1)
plot(scale2)

unfold <- as.data.frame(scale2$conf.row)
voters_scaled <- voters_scaled %>% 
  mutate(unfold_x = scales::rescale(unfold$D1, to = c(-1, 1))) %>%
  mutate(unfold_y = scales::rescale(unfold$D2, to = c(-1, 1))) 

reg4 <- lm(x ~ unfold_x + unfold_y, data = voters_scaled)
reg5 <- lm(y ~ unfold_x + unfold_y, data = voters_scaled)
stargazer(reg2, reg3, reg4, reg5, header = FALSE, type = 'html')
Dependent variable:
x y x y
(1) (2) (3) (4)
PC1 0.568*** 0.113***
(0.004) (0.004)
PC2 0.179*** -0.648***
(0.005) (0.005)
unfold_x 0.607*** 0.117***
(0.004) (0.004)
unfold_y 0.145*** -0.657***
(0.005) (0.004)
Constant 0.071*** 0.101*** 0.055*** 0.102***
(0.003) (0.003) (0.003) (0.002)
Observations 999 999 999 999
R2 0.946 0.951 0.958 0.957
Adjusted R2 0.946 0.951 0.958 0.957
Residual Std. Error (df = 996) 0.093 0.081 0.082 0.076
F Statistic (df = 2; 996) 8,700.073*** 9,752.193*** 11,296.350*** 11,118.850***
Note: p<0.1; p<0.05; p<0.01
ggplot(as.data.frame(voters_scaled), aes(x = unfold_x, y = unfold_y))+
  geom_point(aes(color = sim_actual_clust$simulated_cluster)) + labs(x = "preference ideal point x", y= "preference ideal point y") +
  geom_point(aes(x = x, y = y, color = sim_actual_clust$cluster), data = voters_scaled, alpha = .2) + 
  theme_classic()

The results are very simmilar to my earlier finding that using prefscal or prcomp makes very little difference.

Clustering with redacted preferences

One direction Aaron thought might be worth pursuing was making the simulations more accurate by randomly eliminating preference information given the nature of the actual ballot data I have collected, which features many missing values. As shown below, redacting a significant portion of the preference information does not significantly damage the accuracy of the clustering algorithm.

vote_pref<-data.frame()
a <-seq(-1.5,1.5, by = .4) + rnorm(n = 11, mean = 0, sd = .5)
## Warning in seq(-1.5, 1.5, by = 0.4) + rnorm(n = 11, mean = 0, sd = 0.5): longer
## object length is not a multiple of shorter object length
b <-  rnorm(n = 11, mean = 0, sd = .5)
candidates <- data.frame(a, b)
for(i in 1:length(voters$x)){
  vote_pref<-rbind(vote_pref,rank(pointDistance(candidates, voters[i,2:3], lonlat = FALSE))) }
redact<- round(abs(rnorm(999, mean = 3, sd = 2)),0)
summary(redact)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     2.0     3.0     3.1     4.0     9.0
for (i in 1:length(vote_pref$X2)){
  count <- 0
  for (h in vote_pref[i,]){
    if(count<redact[i]){
      vote_pref[i,h] <- 12
    }
    count <- count+1
  }
}

two_dim_three_clust_kmeans <-kmeans(vote_pref, centers = 3,# number of clusters
                  nstart = 100)# number of random starts
sim_actual_clust <- cbind(two_dim_three_clust_kmeans$cluster, voters) %>% 
  mutate(simulated_cluster = factor(one_dim_three_clust_kmeans$cluster, labels = c("Cluster 1", "Cluster 2", "Cluster 3")))

conf_mat(data = sim_actual_clust, 
         truth = cluster,
         estimate = simulated_cluster )
##            Truth
## Prediction  Cluster 1 Cluster 2 Cluster 3
##   Cluster 1       333         0         7
##   Cluster 2         0       323         1
##   Cluster 3         0        10       325

Conclusions from Data Simulation Component

Overall, my experience with simulating ranked data and ideological spaces suggests three conclusions. 1) Clustering algorithms can identify latent clusters within the data with decent accuracy even given significant overlap and preference redaction. 2) PCA can identify the positioning of individual voters well but does perform well at reconstructing the underlying ideal point space, because of this, there is not a great deal of benefit of using PCA rather than clustering. 3) Basic PCA performs about as well as more specialized tools at performing these tasks. prefscal which estimates the position of both voters and candidates does about as poorly with candidates as it does with voters which is about as poor as PCA.

These findings lead me to the conclusion that I should focus on looking for underlying clusters in my ballot data, and that using PCA rather than prefscal (which is much more computationally intensive) was acceptable going forward.

Empirical Analysis: Data Import

Data for this project comes from three Democratic primaries that occured during the Summer of 2020 in which ranked choice voting was used. These primaries took place after Biden had consolidated the field and had more or less already won, so competition was less intense than it would have been in earlier primaries. Still, my hope was that this data would provide an interesting test case for the ideology space methods I have experimented above and an interesting question, what were the latent coalitions in the Democratic primary? Which candidates did voters see as similar to one another?

# Pull in the data
wyoming <- read_csv("~/Documents/Wyoming data import.csv")
AlaskaCVR <- read_excel("~/Documents/AlaskaCVR.xlsx", na = "under")
KansasCVR <- read_excel("KansasCVR.xlsx")

# relabel columns to enable pivoting

names<-colnames(wyoming)
names[6:10]<- c(1,2,3,4,5)
colnames(wyoming) <- names

# clean and relabel the candidates
wyoming_cleaned <- wyoming %>%
    dplyr::select(BallotID, `1` ,`2` ,`3` ,`4`, `5` ) %>%
    group_by(BallotID) %>%
    mutate_all(funs(recode(., '1' = "Biden", '3' = "Steyer",
                           '4' = "Sanders", '5' = "Klobuchar",
                           '7' = "Gabbard", '8' = "Warren",
                           '9' = "Bloomberg", '11' = "Buttgieg",
                           '12' = "undeclared", 'over' = NA_character_,
                           'under' = NA_character_)))

# pivot the data to preference format

wyoming_l <- wyoming_cleaned %>%
  pivot_longer(-BallotID, names_to = 'rank', values_to = "candidate",  values_drop_na = TRUE) %>%
  pivot_wider(names_from = candidate, values_from = rank, values_fn = first)

# relabel columns to enable pivoting
names<-colnames(AlaskaCVR)
names[7:11]<- c(1,2,3,4,5)
colnames(AlaskaCVR) <- names
AlaskaCVR$BallotID <- paste0(AlaskaCVR$`Tabulator Id`, AlaskaCVR$`Batch Id`, AlaskaCVR$`Record Id`)

# clean and relabel the candidates
alaska_cleaned <- AlaskaCVR %>%
    dplyr::select(BallotID, `1` ,`2` ,`3` ,`4`, `5` ) %>%
    group_by(BallotID) %>%
    mutate_all(funs(recode(., '1' = "Biden", '3' = "Steyer",
                           '4' = "Sanders", '5' = "Klobuchar",
                           '7' = "Gabbard", '8' = "Warren",
                           '9' = "Bloomberg", '11' = "Buttgieg",
                           '12' = "undeclared", 'over' = NA_character_,
                           'under' = NA_character_)))

# pivot the data to preference format
alaska_l <- alaska_cleaned %>% 
  pivot_longer(-BallotID, names_to = 'rank', values_to = "candidate",  values_drop_na = TRUE) %>%
  pivot_wider(names_from = candidate, values_from = rank, values_fn = first)

# relabel columns to enable pivoting

names<-colnames(KansasCVR)
names[2:6]<- c(1,2,3,4,5)
colnames(KansasCVR) <- names

# clean and relabel the candidates

kansas_cleaned <- KansasCVR %>%
    dplyr::select(BallotID, `1` ,`2` ,`3` ,`4`, `5` ) %>%
    group_by(BallotID) %>%
    mutate_all(funs(recode(., '1' = "Biden", '3' = "Steyer",
                           '4' = "Sanders", '5' = "Klobuchar",
                           '7' = "Gabbard", '8' = "Warren",
                           '9' = "Bloomberg", '11' = "Buttgieg",
                           '12' = "undeclared", 'over' = NA_character_,
                           'under' = NA_character_)))

# pivot the data to preference format

kansas_l <- kansas_cleaned %>%
  pivot_longer(-BallotID, names_to = 'rank', values_to = "candidate",  values_drop_na = TRUE) %>%
  pivot_wider(names_from = candidate, values_from = rank, values_fn = first)

# To do: Losing observations between cleaned and long versions, why? 

Jointly Clustering and Scaling the Ballot Data

Having converted all the data to the same format, I label the rows by source, merge them into a single dataset, and run prcomp.

## Add a variable for state name
alaska_l$state<- 'alaska'
wyoming_l$state<- 'wyoming'
kansas_l$state<- 'kansas'

## Bind state dataframes together
all_l<- rbind(alaska_l, kansas_l, wyoming_l)
all_matrix <- all_l[,2:10] %>%
      mapply(FUN=as.numeric)

all_cleaned <- rbind(alaska_cleaned, kansas_cleaned, wyoming_cleaned)

all_mat <- all_l[,2:10] %>%
    mutate_all(~replace(., is.na(.), 6)) %>%
    mapply(FUN=as.numeric)
principle_components_all <- all_mat%>%prcomp()
# Also failed to run with 128GB ram

# pref_scale_all <- all_matrix %>% prefscal(ndim=2,type="ordinal",conditionality="matrix")
all_pca <-bind_cols(all_l,as_tibble(principle_components_all$x)) #, as.tibble(pref_scale_all$conf.row))
ggplot(data = all_pca) +
  geom_density(aes(x = PC1, fill = state),alpha = .4) + 
  theme_minimal()

Without clustering the data or labeling the candidates. This density plot already tells us a few things about the nature of these primaries. First of all, all of them were won by Biden, which suggested that the giant peaks at the left hand side of the graph are voters who ranked Biden and no one else. Second, given that Sanders came in second in all of these states and that Sanders supporters were more likely to rank other candidates, the peaks in the middle of the chart likely correspond to Sanders first Biden second or Biden second Sanders first on the center left and Sanders voters who ranked other candidates such as Warren second on the center right. The right tail of the distribution is likely composed of also rans. Do the clusters generated by knn confirm to these guesses?

Clustering All: How Many Clusters

First, I need to decide how many clusters to use. These cluster selection algorithms wound up crashing even after spinning up a 128 GB Ram AWS machine (screenshots attached) to run them, so I wound up selecting the number of clusters by sampling the data instead. 8 was used as a cutoff for number of clusters because part of the utility of the clusters for this application is for there to be fewer clusters than candidates.

## This takes way too long, even in AWS, wound up going with the backup plan of sampling the full dataset
library(factoextra)
 fviz_nbclust(sample_n(as.data.frame(all_mat), 3000), FUN = kmeans, method = "wss", k.max = 8)

 fviz_nbclust(sample_n(as.data.frame(all_mat), 3000), FUN = kmeans, method = "gap_stat", k.max = 8)

 fviz_nbclust(sample_n(as.data.frame(all_mat), 3000), FUN = kmeans, method = "silhouette", k.max = 8)

All three cluster selection algorithms suggest that 6 is a good number of clusters to use, so that is what I went with. AWS terminal top AWS terminal top AWS terminal top Why do some of these tools not work even with 16 cores (well the cores do not help that much) and 128 GBs of RAM? Some of the errors I ran into indicated that they might have worked with 256 GB of RAM, but I did not want to risk spinning up an AWS instance that big for something that time consuming and have it not work. These tools take up so much memory by scanning a larger and larger possibility space trying to arrive at a specification where their error stat converges towards a global minima, so one possibility given how much RAM these were already trying to use is that such a minima does not exist for how I am setting them up.

all_cluster<-all_pca
# predict six clusters
all_kmeans <-kmeans(all_mat,centers = 6,# number of clusters
                  nstart = 100)# number of random starts

all_cluster$cluster<-all_kmeans$cluster

all_pca<-left_join(all_cluster, all_cleaned, by = 'BallotID' )
table1 <- all_pca%>%
  group_by(`1`, cluster) %>%
  summarize(count=n()) %>%
  arrange(desc(count)) %>%
  spread(`1`,count)
## `summarise()` regrouping output by '1' (override with `.groups` argument)
kable(table1)
cluster Biden Bloomberg Buttgieg Gabbard Klobuchar Sanders Steyer undeclared Warren
1 16789 53 181 91 113 21 20 NA 2549 54
2 45537 13 19 151 3 11497 2 NA 7043 8
3 42023 81 140 112 68 2 30 NA 1 62
4 3564 11 11 623 13 659 6 4448 84 147
5 14764 68 93 96 26 11191 33 NA 2 54
6 67 24 96 774 47 14595 20 37 3900 48

What can the cluster algorithm tell us about the underlying coalitions in the ballot data? 1) Cluster 1: Primarily composed of Biden-Warren voters, this makes up a relatively small chunk of Biden’s base, and a larger but still relatively small chunk of Warren’s base. Given that Warren supporters were percieved as ideologically similar to Sanders supporters but frequently clashed for other reasons, we can think of this cluster as Biden voters who liked Warren and Warren voters who disliked Sanders.

  1. Cluster 2: This cluster is Biden-Sanders-Warren voters and makes up the bulk of Warren’s base and about a third of Sanders’ and Biden’s bases respectively. It is also the largest cluster overall. We can think of this cluster as ‘team player’ Democrats who liked all of the major candidates and just wanted them to get along.

  2. Cluster 3: This cluster is clearly composed of Biden-only or Biden-minor-candidate voters. It is also the second largest cluster. Given the presence of Cluster 2, these voters are probably just Democrats who did not see the utility of ranking more candidates given that Biden was going to win, not Democrats who were deeply hostile to the rest of the field.

  3. Cluster 4: This is a very small cluster composed mostly of voters for undeclared candidates. I do not know of any major write-in campaigns within the Democratic primary, but maybe Kanye got some voters, and the rest are ‘Biden is going to win, just going to write-in someone funny’ voters.

  4. Cluster 5: This cluster is composed of Biden and Sanders but not Warren voters. These voters may have been motivated by sexism or disagreeements that occured between Sanders and Warren and Biden and Warren, or they may just have not seen the utility of ranking Warren when she was so far behind.

  5. Cluster 6: In the same way that Cluster 3 is the Biden-only cluster, this one is very clearly the Sanders-only and Sanders-Warren cluster. Given that Sanders had very little hope of winning at this point. Voting only for Sanders or for Sanders and Warren was a clear signal of disatisfaction with the likely nomination of Biden.

ggplot(data = all_cluster) +
  geom_point(aes(x = PC1, y =PC2, color = factor(cluster)),alpha = .4) + 
  theme_minimal()

ggplot(data = all_cluster) +
  geom_density(aes(x = PC1, fill = factor(cluster)),alpha = .4) + 
  theme_minimal()

What does all of this tell us? Not a ton that is not obvious, but I came away from this clustering case study with a few takeaways.

  1. Most Democrats liked all of the major candidates.
  2. The perception that Sanders and Warren were, depending on how you think about it either 1) part of the same coalition, or 2) competing for the same set of voters, is broadly born out by the data. If future primaries expand the use of ranked choice voting, progressive candidate may be able to run coalition strategies to maximize their vote share rather than tearing each other apart.
  3. There are more write-in voters than I would have expected and write-in voters are a more statistically anomalous group than I would have thought.

While the utility of clustering as a tool for ranked data would be clearer if we were trying to compare across voting systems or had another dataset we wanted to merge onto (like census data or a precinct map), overall I think this exercise shows that the clusters would likely have interesting implications for such a study, and that the clusters are interesting in and of themselves.